As part of Udacity’s Data Analyst Nanodegree I analyse the data set from [Cortez et al., 2009] which includes 1599 data rows related to the Portuguese “Vino Verde”. The goal is to assess the quality of red wine based on 11 physiochemical input variables and one sensory output variable (quality).
First I have a look into the individual variables, then two and more variables are analysed in combination to get insights about possible correlations. Finally, I build a model based on the most relevant variables and try to predict quality from them.
For fans of good wine - like me - it is very interesting to take this data driven approach towards one of the most delicious (alcoholic) beverages. Let’s see what we can find out.
The data set is provided in clean form as a csv file and can easily be imported into R for our analysis.
Taking a look at the structure, there are 13 variables overall. One of them is a count number (“X”), 11 variables are containing input data and 1 variable is the output “quality”.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
When we look at the data types we can see that there are only numeric variables, including two Integers (“X” and “quality”):
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
From the data source we also get information about the units:
Furthermore, we find some explanation about the variables:
The first three variables are acids which can add an unpleasant taste (high levels of volatile acidity) but also freshness (citric acid). Sugar defines, if the wine is considered sweet (greater than 45 grams/liter). Chlorides represent the amount of salt. Then there is free sulfur dioxide and total sulfur dioxide (free and bound SO2) which prevent oxidation and only become evident at concentrations over 50 ppm (free SO2). The density depends on the percent alcohol and sugar. pH describes how acidic or basic a wine is (between 0 and 14). Sulphate is added to wine as antioxidant. Finally, there is alcohol and quality. The quality score is between 0 and 10.
To get a first impression of the data, we plot the histograms:
We can see differently shaped distributions: some are bell shaped (e.g. density and pH), others have long tails (e.g. free.sulfur.dioxide) and some are very narrow (e.g. residual.sugar and chlorides).
Looking at the histogram of total.sulfur.dioxide, we can see the left skewed plot with a long tail to the right. Let’s look at the graph on a log scale:
With the log transformation the shape is much more bell like. Probably this could be interesting later on.
There seem to be some outliers on the outer bounds. We can further analyse this by drawing the boxplot:
In the original data on the left we can find several outliers close to 1.5 times IQR and two far away at nearly 300 mg/dm^3. On the right we see the log transformed scale. In this case all data points are within 1.5 times IQR.
Next, we can try to simplify the data by combining the three acid variables (fixed.acidity, volatile.acidity and citric.acid). While all are measured in g/cm^3, we can add them together and create a new variable ‘all.acids’
The distribution of the newly created variable looks like this:
Let’s have a closer look at the variable which is probably the most interesting one: quality. To get a first idea of the distribution of the data, we choose a histogram.
We can see that all wines have a quality between 3 and 8 (on a scale from 0 to 10). To verify this we can get the minimum and maximum values:
## [1] 3
## [1] 8
As expected we get 3 as minimum and 8 as maximum value for quality.
In the histogram we can see that most of the wines have a quality rating of 5 or 6. Only some have a rating of 7 and even less ratings of 3, 4 or 8.
This is somehow surprising. Intuitively the bell shaped distribution makes sense (only some are really bad or good) but we would expect to see at least some wines on the outer bounds (e.g. 0, 1, 2, 9 and 10).
Let’s also compute some statistics for quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The mean is pretty much in the center of the scale at 5.636 and the median is 6.000.
There are 11 relevant input variables and 1 output variable (quality) describing 1599 data rows.
The feature of interest is quality which is a score between 0 and 10 based on sensory data. We want to asses, which of the input variables have the most influence on the quality of a wine.
Some of the 11 input variables should have a correlation with the quality. The acid levels or pH values could be interesting in this sense. Also the alcohol percentage is an interesting candidate (from personal experience, higher alcohol levels seem to improve the taste). Too much SO2 could have a negative impact.
On the other hand the spread of sugar and chlorides are quite narrow. Therefore, it seems unlikely that they have a big impact on the quality in this data set.
Yes. By adding the three acid values together, I try to simplify the data. We will see later, if this makes sense.
The data came in clean form and it doesn’t seem necessary to tidy it.
We saw that the histogram for total sulfur was skewed to the left. In an attempt to get more insights, we log transformed it. This revealed a bell shaped curve. Though, it is unclear, if this can be helpful. We will keep it in mind for our further explorations.
To get an overview of the bivariate correlations we start with plotting a matrix of plots and correlations:
The highest correlation values (>0.5) that we get are:
For quality the following four variables have the highest correlation (>0.2)
##
## Pearson's product-moment correlation
##
## data: quality and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
##
## Pearson's product-moment correlation
##
## data: quality and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Pearson's product-moment correlation
##
## data: quality and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
##
## Pearson's product-moment correlation
##
## data: quality and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
The biggest correlation is between alcohol and quality. Let’s plot the data:
For this plot only the values smaller than the 95% quantile of the y-axis (alcohol) were used to get a better view of the main values.
We can see a slight increase in alcohol levels for higher quality values. However, there are less data points available for the extreme quality levels (3 and 8). Thus, we have to be careful with our interpretations.
Let’s also draw box plots for the same variable combination:
With this plot the correlation is more visible. Especially for quality values between 5 and 8, the median alcohol percentage is increasing.
We can now also combine the two plots:
The combination of point and box plot show us not only the major stats - median, mean (red star) and the 25% / 75% quartile - it also shows us all the individual points and their distribution. Once again the most interesting part is the upwards slope of the median and mean alcohol level.
The correlation values above also state a negative value for volatile.acidity. We can plot it to evaluate it further:
And we plot the box plot, too:
The two plots verify what the correlation value states. For high quality wines the volatile.acidity value is less than for bad wines.
Interestingly, the correlation value for citric.acid - an other acid value - is positive. The plot looks like this:
And again the box plot:
As we expected, the plots show a positive correlation between quality and citric.acid. The median citric.acid is the highest for quality 8 and lowest for quality 3.
We see that there are some variables that correlate with quality.
What is quite interesting is the fact that two acid variables seem to have contrary effects on the quality. While volatile.acidity has a negative influence, citric.acid has a positive effect. This was already hinted in the variable information at the beginning. So it seems to be true, that high values of volatile.acidity adds an unpleasant taste, but citric.acid on the other hand, contributes to a positive freshness.
pH and the acid variables (fixed.acidity, volatile.acidity and citric.acid) have absolute correlation values greater than 0.2. A correlation between these values intuitively make a lot of sense, because pH is a measure for acidity. There is a negative correlation between pH and both fixed.acidity and citric.acid which is logical - lower pH values represent high acidity. However, there is a positive correlation for pH and volatile.acidity. This must probably be an error, because it contradicts the definition of pH.
As pH and the acid variables should represent the same basic concept (acidity), but volatile.acidity and citric.acidity are more speaking (higher correlation values for quality), we should use the later two for our further modeling.
The highest correlation is between alcohol and quality with a value of 0.48. This confirms one of our assumptions in the beginning. It seems that high alcohol levels support the taste of wines.
The fact that most wines have a rating of 5 or 6, leave little values for the higher and lower quality values. Hence, we have to be careful, when interpreting the correlation values. Often there are only some values for the outer bins. We will keep this in mind for our further assessment.
Next, we want to look at multiple variables at a time. As we saw earlier, there were correlations between quality and the acid levels. Thus, we start by plotting volatile.acidity and citric.acid against quality:
We can see the vertical separation between the two different acids. Volatile.acidity (red) has bigger values and declines with increasing quality. Citric.acid (orange) on the other hand, has lower values overall and increases for higher quality scores.
Let’s plot the two acids against each other and color the points by quality:
There seems to be a slight tendency that good wines - 7 (purple) and 8 (pink) - lie in the bottom right area of the point cloud.
In order to see more detail, we can exclude the outliers and only plot the values below the 99% quartile:
There is still no clear separation of the color/quality of the points. But we can see a lot of blue points in the bottom right area and also some pink ones.
The problem is that there are far more values for bad and mediocre wines, than for the good ones.
To mitigate this, we can try to simplify the data by clustering the quality in just two groups “good” and “bad”. All wines with a quality less than 7 is considered bad, wines with a quality of 7 or 8 is considered good.
##
## bad good
## 1382 217
With this grouping we get 1382 bad and 217 good wines.
With this plot it looks like that the bad wines (light blue) are distributed over the whole scale. Though, we see some clustering around very low citric.acid and higher volatile.acidity values. In contrast there is a cluster of good wines (dark blue) for higher citric.acid and lower volatile.acidity levels.
The variable with the strongest correlation with quality is alcohol. Let’s have a look and plot it versus volatile.acidity:
Again, not a very strong separation but definitely some tendencies. We find a lot of green points on the left side (low alcohol level) and many blue points in the bottom right area (high alcohol, low volatile.acidity).
Let’s plot the data again, this time withe the grouped quality:
We can see a big cluster of bad wine between 9-10.5% alcohol and 0.2-0.8g/cm^3 volatile.acidity (approximately). On the other hand, most of the good wine, we can find for alcohol levels over 10% and volatile.acidity levels smaller than 0.6g/dm^3.
Let’s also verify this by calculation the main stats for both quality groups and the variables in interest:
## # A tibble: 2 <U+00D7> 5
## quality.grouped mean_alc median_alc mean_vol.ac medain_vol.ac
## <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 bad 10.25104 10.0 0.5470224 0.54
## 2 good 11.51805 11.6 0.4055300 0.37
Both the mean and median alcohol level is higher for good wines - e.g. good wine has an average alcohol level of 11.52% vs. 10.25% for bad wines. The volatile.acidity level on the other hand, is lower for good wines (the mean and the median) - e.g. 0.4 vs. 0.54 g/dm^3.
Based on what we saw, we can assume that there is some correlation between quality and the three input variables alcohol, volatile.acidity and citric.acid. We want to facilitate this correlation and build a model to predict the quality of wines.
First, we extract a random sample (of length 100) from the data set which we can use later to verify our predictions.
Then, we build a linear model and add the three variables (alcohol, volatile.acidity and citric.acid) one after another.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = subset(wine, !X %in% sample.x))
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = subset(wine,
## !X %in% sample.x))
## m3: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid,
## data = subset(wine, !X %in% sample.x))
##
## =====================================================
## m1 m2 m3
## -----------------------------------------------------
## (Intercept) 1.823*** 3.029*** 2.999***
## (0.181) (0.191) (0.201)
## alcohol 0.366*** 0.320*** 0.320***
## (0.017) (0.017) (0.017)
## volatile.acidity -1.373*** -1.343***
## (0.098) (0.117)
## citric.acid 0.050
## (0.106)
## -----------------------------------------------------
## R-squared 0.230 0.319 0.319
## adj. R-squared 0.230 0.318 0.318
## sigma 0.712 0.670 0.670
## F 448.203 350.393 233.549
## p 0.000 0.000 0.000
## Log-likelihood -1617.168 -1525.506 -1525.394
## Deviance 759.256 671.855 671.753
## AIC 3240.336 3059.013 3060.787
## BIC 3256.274 3080.263 3087.350
## N 1499 1499 1499
## =====================================================
We get R-squared values of 0.230 for our first, 0.319 for our second and 0.319 for out third model. This means that alcohol contributes the most to our model. The first acid variable improves it further, but the second does not really add more value.
Now that we have our model, we can use it to calculate predictions for our previously extracted sample. We then compare the predictions with the real quality values.
## fit lwr upr quality is.in.bounds diff
## 943 5.642661 4.327020 6.958302 7 FALSE -1.357339
## 1506 5.164385 3.848120 6.480650 3 FALSE 2.164385
## 133 6.493289 5.174705 7.811873 5 FALSE 1.493289
We find out that from the 100 samples only 3 lie outside the 95% confidence interval.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.35700 -0.37200 0.14620 0.06675 0.48790 2.16400
The average difference between the prediction and the real quality value is 0.07, the median difference is 0.15.
Based on our earlier findings, we plotted the highly correlating variables against each other and tried to find patterns describing the quality of wines. We were able to verify the inverse nature of the two acid variables volatile.acidity and citric.acid: While the first decreases the quality, the second increases it. We found clusters of good quality wine for low volatile.acidity levels and high citric.acid values.
When we plotted alcohol against volatile.acidity we found clusters of good wine for high alcohol percentages and low acid levels.
To summarize the findings, we can say that good wine tends to have a high alcohol percentage, low volatile.acidity and high citric.acid level. This resonates with what was stated in the data set information and with my personal intuitions.
With our three major variables (alcohol, volatile.acidity and citric.acid) we constructed a linear model and used it to predict the quality of wine. By extracting a sample from the whole data set before the model creation, we were able to test actual wines. The result of the model estimate was compared to the real quality value.
In the end only 3 of 100 samples were outside our 95% confidence interval. In the sense of 5% uncertainty, 3 (or less than 5) errors are totally consistent. Furthermore, the mean difference between estimate and actual quality was only 0.07 and the median difference was 0.15.
The nature of a sensory and somehow subjective variable like wine quality, make it quite hard to come up with a formal model. Therefore, our results are quite satisfying.
What could cause problems is the fact that we have no data for the more extreme quality scores (0, 1, 2, 9 and 10). Estimates for extreme values of our input variable are likely to be inaccurate. If there was a bigger data set containing more data, we could probably improve our model.
The histogram of quality was one of the first plots that we created. Before drawing it we expected the data to be more distributed over the whole scale (0 to 10). But actually there are only quality scores between 3 and 8. In addition, the majority of values is found for the scores 5 and 6.
In this jitter plot we can nicely see the contrary correlation of volatile.acidity and citric.acid to quality. The linear trend lines show the tendency that volatile.acidity levels decrease and citric.acid levels increase for good wine.
This plot shows us alcohol vs. volatile.acidity and colors good wines with dark blue. We can see the bad wines clustering on the left at low alcohol and higher acid levels.
The trend lines visualize the higher acid level of bad wines. For bad wines the acidity is slightly decreasing with higher alcohol percentages. The trend line for good wines, on the other hand, show upwards slope. We might argue that with higher alcohol percentages, the acid level has a smaller impact on quality for these wines.
We can verify these findings with some statistics: The mean alcohol level for good wine is 11.52%, while it is 10.25% for bad wines. For volatile.acidity it is the other way around: the mean value of good wine is 0.4 g/dm^3 and 0.54 g/dm^3 for the bad ones.
The analysis of the red wine data set gave us some interesting insights.
In the beginning of our exploration we were surprised by the shallow distribution of the quality scores. When we think about it, we can probably explain it by the fact that all data is from the same kind of Portuguese wine (namely “Vino Verde”). This made it a little hard to interpret the data and correlations between individual variables. One way to mitigate this was to cluster the data into two groups and only distinguish between good and bad wine.
We found three variables that had high correlation values for quality and seemed interesting: alcohol, volatile.acidity and citric.acid. Alcohol had the strongest correlation with quality. We were able to verify this in point and box plots. The acid variables were quite interesting because they had inverse correlations with quality. Volatile.acidity seems to have a bad effect on quality, while citric.acid has a positive impact. A trend line showed this nicely in a plot containing both acids.
Finally, we combined our findings and created a linear model from the three variables. Due to the limited data, our expectations were moderate. However, we were able to predict quality for our samples pretty accurately. In average the quality estimate was only 0.07 away from the real score.
It would be interesting to improve our model based on a bigger data set, for example with data from various wines from different countries.
Red Wine data set [Cortez et al., 2009]:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
Detail information about the wine data set: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
ggpairs text size (hint from instructor): http://stackoverflow.com/questions/8599685/how-to-change-correlation-text-size-in-ggpairs
Matrix to data frame (as.data.frame): http://www.statmethods.net/management/typeconversion.html
Rename levels: http://stackoverflow.com/questions/29711067/r-how-to-change-name-of-factor-levels
Custom plot legend: http://stackoverflow.com/questions/10349206/add-legend-to-ggplot2-line-plot?noredirect=1&lq=1
Idea and sample code of the combined box/jitter plot and regression lines for final plot no. 3 provided by instructor.